Learning Derived Words from Medical Corpora
نویسندگان
چکیده
Morphological knowledge (inflection, derivation, compounds) is useful for medical language processing. Some is available for medical English in the UMLS Specialist Lexicon, but not for the French language. Large corpora of medical texts can nowadays be obtained from the Web. We propose here a method, based on the cooccurrence of formally similar words, which takes advantage of such a corpus to learn morphological knowledge for French medical words. The relations obtained before filtering have an average precision of 75.6% after 5,000 word pairs. Detailed examination of the results obtained on a sample of 376 French SNOMED anatomy nouns shows that 91–94% of the proposed derived adjectives are correct, that 36% of the nouns receive a correct adjective, and that this method can add 41% more derived adjectives than SNOMED already specifies. We discuss these results and propose directions for improvement.
منابع مشابه
Comparative Study of the Academic Vocabulary Content of Electronic Engi-neering Corpora, GE Materials and M.S. Entrance Examinations
The importance of vocabulary learning has been underlined in the field of English for Academic Purposes (EAP) because non-English majors who require reading English texts in their fields of study have to expand their English vocabulary knowledge much more efficiently than ordinary ESL/EFL learners. Since academic vocabulary instruction in Iranian universities is realized through the use of Gene...
متن کاملSVitchboard II and fiSVer i: high-quality limited-complexity corpora of conversational English speech
In this paper, we introduce a set of benchmark corpora of conversational English speech derived from the Switchboard-I and Fisher datasets. Traditional ASR research requires considerable computational resources and has slow experimental turnaround times. Our goal is to introduce these new datasets to researchers in the ASR and machine learning communities (especially in academia), in order to f...
متن کاملLearning Method for Automatic Acquisition of Translation Knowledge
This paper presents a new learning method for automatic acquisition of translation knowledge from parallel corpora. We apply this learning method to automatic extraction of bilingual word pairs from parallel corpora. In general, similarity measures are used to extract bilingual word pairs from parallel corpora. However, similarity measures are insufficient because of the sparse data problem. Th...
متن کاملMachine Learning Comprehension Grammars for Ten Languages
Comprehension grammars for a sample of ten languages (English, Dutch, German, French, Spanish, Catalan, Russian, Chinese, Korean, and Japanese) were derived by machine learning from corpora of about 400 sentences. Key concepts in our learning theory are: probabilistic association of words and meanings, grammatical and semantical form generalization, grammar computations, congruence of meaning, ...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کامل